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ABSTRACT 

High-throughput sequencing of cDNA libraries con- 
structed from cellular RNA complements (RNA-Seq) 
naturally provides a digital quantitative measure- 
ment for every expressed RNA molecule. Nature, 
impact and mutual interference of biases in different 
experimental setups are, however, still poorly 
understood— mostly due to the lack of data from 
intermediate protocol steps. We analysed multiple 
RNA-Seq experiments, involving different sample 
preparation protocols and sequencing platforms: 
we broke them down into their common— and 
currently indispensable— technical components 
(reverse transcription, fragmentation, adapter liga- 
tion, PCR amplification, gel segregation and 
sequencing), investigating how such different 
steps influence abundance and distribution of the 
sequenced reads. For each of those steps, we 
developed universally applicable models, which 
can be parameterised by empirical attributes of 
any experimental protocol. Our models are imple- 
mented in a computer simulation pipeline called 
the Flux Simulator, and we show that read distribu- 
tions generated by different combinations of these 
models reproduce well corresponding evidence 
obtained from the corresponding experimental 
setups. We further demonstrate that our in silico 
RNA-Seq provides insights about hidden precursors 
that determine the final configuration of reads along 
gene bodies; enhancing or compensatory effects 
that explain apparently controversial observations 
can be observed. Moreover, our simulations 



identify hitherto unreported sources of systematic 
bias from RNA hydrolysis, a fragmentation tech- 
nique currently employed by most RNA-Seq 
protocols. 

INTRODUCTION 

Read abundances from RNA-Seq experiments reflect the 
quantities of different RNA molecules in the interrogated 
transcriptome (1). It is commonly accepted that gene ex- 
pression profiles exhibit a similar shape across evolution- 
ary distant organisms and functionally diverse cell types. 
Observations based on expressed sequence tags (2) show 
that most transcripts are rare, some are moderately 
abundant and only a small portion is very abundant. 
Such unbalanced distribution can be modelled using 
Zipf s Law (3) which exhibits a characteristic linear behav- 
iour in log-log (4). Furusawa and Kaneko (2) link the 
reason behind this observation to general thermodynamic 
diffusion constants that determine power law distributions 
in a large spectrum of biomolecules, whereas Ogasawara 
et al. (5) propose an evolutionary model. 

However, experimental protocols are increasingly 
reported to generate deviations from the expected read 
distributions (6-8). Since the first ultra sequencing experi- 
ments on cellular transcriptomes (9,10), sample prepar- 
ations for so-called RNA-Seq have evolved in multiple 
respects and generated a considerable repertoire of proto- 
cols, which however all stem from a common set of elemen- 
tary components. First and foremost, because all current 
sequencing technologies can only handle DNA substrates, 
reverse transcription (RT) of RNA into cDNA has to be 
accomplished. In the first protocols to be proposed for 
library preparation, RT constituted the initial step, 
involving either poly-<fT (for poly-A+ transcriptomes) 
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or random primers (usually hexamers) to initiate first- 
strand synthesis. Poly-<fT oligomers primarily bind in 
the region of the poly-A tail, which — especially for long 
transcripts — can result in template loss during RT of the 
entire molecule and thereby cause a loss of 5'-end infor- 
mation (9). Randomly primed first-strand synthesis of 
full-length RNA molecules, in contrast, can lead to a 
relative over-representation of the 5'-end information 
(11). To diminish RT-related biases, RNA-Seq concepts 
have changed towards protocols that postpone RT after 
fragmentation, which seems to prevent a strong bias of 
read abundances towards the 3'-end (1). 

Second, fragmentation of transcripts is necessary 
because current sequencing platforms produce relatively 
short tags from the ends of much longer DNA molecules. 
Therefore, any attempt to sequence non-fragmented RNA 
populations would result in reads that exclusively repro- 
duce the ends of transcripts. First RNA-Seq protocols 
relied on fragmentation by restriction enzymes (e.g. 
Nlalll or DpnII) to cleave reversely transcribed cDNA 
(9,12). Due to the sequence specificity of each restriction 
enzyme, however, the number of fragments produced by 
enzymatic digestion is not directly comparable between 
transcripts of different sequence compositions; for 
instance, ~4% of known Drosophila melanogaster genes 
do not exhibit a Nlalll-recognition site (13), and even 
degradation by the endonuclease DNAsel — so far con- 
sidered as unspecific — has recently been reported to 
exhibit strong sequence-selective characteristics (7). 
Therefore, efforts were soon directed towards the devel- 
opment of sequence-independent 'random' fragmentation 
protocols (13), at first by employing nebulisation, the 
physical shearing of cDNA molecules in a liquid 
medium (14). Although being cost-efficient and effective, 
nebulisation has been criticised for its inability to 
fragment DNA chains shorter than ~700-800nt and for 
producing suboptimal fragment size distributions when 
subsequent size selection steps are present (15). 
Consequently, current RNA-Seq protocols implement 
fragmentation by the controlled hydrolysis of RNA — 
usually catalysed by heat and acetate-complexed Mg 2+ 
or Zn 2+ ions (11) — which is generally considered to 
produce uniformly distributed fragments (1). 

After RT, during the 'final library preparation', adapter 
sequences are ligated to both sides of the double-stranded 
DNA molecules, which mediate the binding of fragments 
to beads and harbour primer-binding sites for amplifica- 
tion. Randomly primed RT (7) and/or the adapter ligation 
process (16,17) promote sequence-selective biases which 
manifest as motifs at the fragment ends (7,8); promising 
RNA-ligation based protocols avoiding both steps have 
been demonstrated (17,18). Before sequencing, fragments 
of the primary library are often amplified by a polymerase 
chain reaction (PCR), because the most cost-efficient 
sequencing platforms to date do not accept single 
molecule substrates. Amplification efficiency is known to 
depend on the GC content of the respective molecule (17), 
although controversial reports on the correlation between 
GC content and RNA-Seq coverage have been published 
(6,17,19). Leading high-throughput technology providers 
therefore suggest a size selection step in order to keep 



amplification biases under control by making fragment 
lengths homogeneous: e.g. 300-1000 nt long fragments 
are recommended for the Roche's pyrosequencer (20), 
and 200nt±25nt are usually suggested for Illumina 
sequencing experiments. Size selection in general is imple- 
mented by gel electrophoresis, which suffers from artefacts 
like molecule aggregates (21). 

Finally, the 'sequencing' step obtains one arbitrary end 
(single reads) or both ends (paired-end reads) of the 
cDNA fragments in the library. Read sequences undergo 
modifications according to the technical limitations of the 
corresponding platform, e.g. insertions/deletions (indels) 
typically occur in reads produced by Roche pyro- 
sequencing (22), whereas Illumina sequencing platforms 
mainly exhibit read sequences with an increased rate of 
nucleotide substitutions — and a correspondingly 
decreased quality — towards the read end (6). Additionally, 
the interplay between sequencing chemistry, sequencer 
machine calibration and the base calling algorithm 
employed during the downstream analysis of raw data 
determine subtle preferences in the so-called 'crosstalk', 
i.e. the misrecognition of chromophore-marked nucleo- 
tides (23). 

MATERIALS AND METHODS 

Simulation of different fragmentation processes 
Enzymatic digestion 

In our implementation of in silico enzymatic digestion, 
position weight matrices are employed to capture the 
sequence selectivity of the corresponding enzyme (e.g. 
Nlalll or DpnII) and fragmentation points of cDNA mol- 
ecules are determined by importance sampling. 

Nebulisation 

According to preliminary modelling attempts (24), poten- 
tial breakpoints are distribtuted as a Gaussian function 
around the molecules' midpoints and the breaking prob- 
ability is drawn from an exponential of the ratio between 
the fragment size and the limiting size below which mol- 
ecules are unlikely to be broken any further (A. = 700 nt 
for cDNA, Supplementary Methods). 

Hydrolysis 

Previously published models of hydrolysis are based on 
the assumption that fragment sizes produced by uniformly 
random fragmentation of molecules with the same length 
fall along a characteristic Weibull distribution, if the decay 
rate depends on molecule size (25). Here we propose a 
model for transcript populations of heterogeneous 
lengths, where we empirically derive a logarithmic depend- 
ence of the Weibull shape-parameter on the molecule's 
length (see Supplementary Methods and Results section). 

Simulation of reverse transcription 

We model RT separately for first- and second-strand syn- 
thesis. The start point depends on the priming strategy 
(i.e. parameters Poly-</r or random) and optionally by a 
position weight matrix (PWM) describing the sequence 
bias. The primer extension length is parameterised by 
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the minimum (RT m { n ) and maximum length (RT m!lx ) of the 
expected cDNA molecules (Supplementary Methods). 

Simulated size selection 

As for the fragment sizes observed after gel electrophor- 
esis, the Flux Simulator accepts parameterised normal 
distributions or empirical distributions. Fragments are 
subsampled according to such distributions, either by 
acceptance-rejection sampling or by the Metropolis- 
Hastings algorithm (26,27). 

Simulated adapter ligation and PCR amplification 

We simulate the reaction kinetics of the adapter ligation 
process — reflected by motifs of sequences that are 
preferred by the involved enzymes — as Bernoulli trials 
parameterised by a PWM representing the sequence 
bias. PCR-amplification is sensitive to the GC content of 
the amplified DNA stretches and in agreement with 
previous observations (17), we model PCR-efficiency as 
a quantity distributed normally about a GC-optimum 
(Supplementary Methods). 

Simulated sequencing 

During in silico sequencing, the fragments in the library 
are subsampled and the sequence of either an arbitrary 
end for single reads, or of both ends for paired mates, is 
obtained. The number of reads and their length may be 
specified; however, there cannot be more reads than the 
number of fragments in the library, nor can any read be 
longer than the fragment it comes from. The orientation 
of the reads is characteristic in sequencing-by-synthesis 
protocols (13,17) due to an intrinsic attribute of polymer- 
ases progressing strictly from 3' to 5' along the template 
(Supplementary Figure SI). For lllumina chemistry, we 
additionally implemented a quality-based error model 
(Supplementary Figure S2). 

Simulated gene expression 

In agreement with preliminary observations (2,5), our 
analysis of the reference data sets demonstrates that 
gene expression profiles estimated from RNA-Seq data 
exhibit an about linear shape in log-log space up to the 
first thousands of gene expression ranks (Supplementary 
Figure S3). By non-linear fitting to the experimental data, 
we deduced a modified Zipf s Law, which we employ to 
assign randomised expression levels to genes and tran- 
scripts in our simulations (Supplementary Methods). 

In the Flux Simulator, we also include the simulation of 
two biologically relevant modifications of annotated tran- 
scripts: transcripts with the same splicing structure, i.e. 
identical configuration of introns that are removed 
during the processing of nascent RNA, still may vary in 
their precise transcription start site and in the length of 
their poly-A tail (Supplementary Methods). These features 
can have a significant impact on the physical attributes of 
the corresponding molecules, playing an important role 
during library preparation. 



Data source and basic processing 

For our analysis, we employed publicly available read 
data (Supplementary Methods) from: Saccharomyces 
cerevisiae (9), Arabidopsis thaliana (28), Mus musculus (1 1), 
the same Homo sapiens sample sequenced with two differ- 
ent RNA-Seq protocols, i.e. flowcell RT-Seq (FRT) and 
standard hydrolysis (STD) protocol (17), and RNA control 
sequences spiked-in in high concentrations (29). In a first 
step, we mapped and split-mapped non-redundantly all 
the reads to the respective reference genome sequence 
using the GEM library (http://sourceforge.net/projects/ 
gemlibrary); in the case of the cress data set, which is com- 
paratively small, we also considered additional read 
mappings with long indels obtained with BLAT (30). 

Subsequently, we focused on the distribution of reads 
that map to transcripts without alternatively processed 
forms. To define such transcripts, we considered a 
standard reference annotation of the transcriptome, i.e. 
the SGD annotation for yeast (31), the TAIR annotation 
for cress (32) and the murine as well as the human RefSeq 
annotation (33). This procedure provided us with 
mappings for 6606768 reads (47%) from yeast, 351 336 
reads (65%) from cress and for 21 359 481 reads (68%) 
from mouse, and with 530996 reads that map in proper 
pairs to the spike-in control sequences. Due to substan- 
tially different data set sizes (90 million versus 13 million 
reads), in the case of the human FRT- and the STD-Seq 
experiments, we extracted subsets of reads of suitable size 
before mapping to ensure comparability (Supplementary 
Table SI). 



RESULTS 

Overview of the Flux Simulator RNA-Seq pipeline 

We implemented a computer pipeline for simulating 
RNA-Seq experiments — which we call the Flux 
Simulator — comprising explicit models for the processes 
that determine abundance and distribution of read tags 
according to the specified experimental protocol (see 
Figure 1 and Methods section). Starting from a genomic 
sequence for a certain species and a corresponding anno- 
tation of gene structures, the first step of this pipeline is, in 
fact, a transcriptome simulation (Figure 1A) where — if no 
pre-defined cell expression profile is available — annotated 
genes and transcripts are assigned randomised expression 
levels according to the general laws of gene expression 
(Supplementary Figure S3). 

Next, the in silico transcriptome undergoes RT/frag- 
mentation (Figure IB and C) according to the established 
experimental techniques: in one possible scenario, RNA 
molecules are first reversely transcribed into cDNA — 
adopting poly-t/T or random primers — and afterwards 
fragmented by nebulisation or enzymatic digestion 
(Figure IB and C, left); alternatively, fragmentation is 
carried out by RNA hydrolysis before fragments are 
transcribed into cDNA molecules by random priming 
(Figure IB and C, right). 

The Flux Simulator pipeline also provides optional 
steps to model the final library preparation, involving in 
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Figure 1. Outline of the Flux Simulator pipeline. Provided the genomic sequence of an organism and a representative gene annotation as input, the 
initial step is a transcriptome simulation (A) to assign each transcript a randomised expression level according to general laws of gene expression. 
Subsequently, fragmentation (B) and RT (C) are carried out, either by first hydrolysing RNA and then transcribing the fragments into cDNA 
molecules (B and C, right) or by nebulisation respectively enzymatic digestion after reversely transcribing the entire RNA molecules (B and C, left). 
The simulated molecules of the primary library then get amplified by in silico PCR (D) — optionally after selecting a certain size range — and the final 
library then is subjected to simulated sequencing (E), including potential platform and sequencing chemistry specific error models. Finally, read 
sequences along with their genomic mappings are obtained. 



silico ligation of adapter sequences, fragment size selection 
and PCR amplification (Figure ID). Eventually 
high-throughput sequencing is simulated at the level of 
the single DNA molecule, offering the possibility to 
include platform-specific base calling errors (Figure IE). 
The output comprises the read sequences and their 
genomic locations. 

Physical properties of fragments produced by 
RNA-hydrolysis 

By 'uniform fragmentation' we refer to the sequence- 
independent selection of breakpoints, as implemented for 
instance by DNA nebulisation or RNA hydrolysis, which 
is not to be confused with uniform breakpoint distribu- 
tions along each transcript. In contrast to reports on the 
unequal representation of transcripts by nebulisation, 
fragmentation from RNA hydrolysis is considered to 
produce fragments of comparable lengths (11) without 
positional preferences (1). In this section, we study both 
hypotheses by simulating with our pipeline the experimen- 
tal distributions observed for the so-called spike-in 
controls of known sequences (29). To this end, we 
extend a model proposed for uniform random fragmenta- 
tion processes when the breaking probability depends on 
molecule size (25). 

Paired-end reads generated from spike-in control se- 
quences are particularly well suited to assess differences 
in fragment size distributions, as biases from incomplete 



transcript annotation can safely be excluded, and 
fragment sizes can be estimated straightforwardly by the 
distance between mapped read mates. Figure 2A demon- 
strates that fragments originating from three highly 
covered control sequences having substantially different 
lengths (i.e. 35 838 hydrolysis fragments from the 
11 934nt long Lambdaclonel-1, 472364 fragments from 
the 1429 nt long OBF5, and 21264 fragments from the 
376 nt VATG sequence) also exhibit markedly different 
size distributions: when an arbitrary size of 150nt is 
chosen as the threshold between short and long forms, 
we observe 36% fragments <150nt for the short RNA 
control VATG (Figure 2A, green curve), whereas short 
fragments account for only 22% of the molecules in the 
case of the typical messenger-sized control OBF5 
(Figure 2A, red curve), and their proportion drops to 
15% for the long control sequence Lambdaclonel-1 
(Figure 2A, blue curve). 

The analysed experiment employs a gel segregation step 
in which exclusively fragments with the overall size attri- 
butes shown in Figure 2B are selected. Therefore, one 
cannot computationally cast back to the intermediate 
size distribution of fragments after fragmentation and 
before size selection. However, a previously published 
model for uniformly random fragmentation processes 
in molecules having the same length predicts that the 
sizes of the produced fragments follow a Weibull distri- 
bution — which is specified by two characteristic 
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Insert Size Insert Size Insert Size 

Figure 2. The size of an RNA molecule determines the hydrolysed fragments" size distribution. (A) Theoretical and experimentally measured 
probability distributions of three control sequences with significantly distinct lengths: VATG (376 nt long, green lines), OBF5 (1429 nt, red lines) 
and Lambdaclonel-1 (11 934 nt, blue lines). The solid lines represent frequencies observed by the reads of an experiment, whereas the dashed lines are 
simulated results, obtained by the procedure depicted in B and C. Lambdaclonel-1 — the longest control sequence — exhibits relatively less short 
fragments (blue curve), whereas VATG — the shortest control sequence — shows a comparatively lower fraction of long fragments (green curve). (B) 
The distribution used for simulated size filtering (solid line) is obtained by adjusting the insert size distribution observed across all sequences of the 
experiment (dashed-dotted line) with the combined distribution of insert sizes before filtering (dashed line), as estimated by simulated hydrolysis of 
the three spike-ins. During simulated size selection, weighted subsampling according to the filter distribution (solid line) is applied to the fragment 
size distributions (C) to derive the distributions predicted by the simulation after size filtering (A). (C) Computational prediction of fragment size 
distributions obtained from either control sequence after simulated hydrolysis, i.e. before size selection. According to the employed model, fragment 
sizes fall along Weibull distribution of same scale (= 200nt) but different shapes (8 = 2.6, 3.2 and 4.1 for VATG, OBF5 and Lambdaclonel-1, 
respectively). Subjecting these simulated fragments to in silico size selection (B) can reproduce the differences observed in experimental results for the 
investigated controls (A). 



parameters, the shape (§) and the scale (r|). According to 
Figure 2B we conducted an exhaustive search within 
the relevant parameter space followed by simulated size 
selection (Supplementary Materials and Methods) and 
we found that the differences observed for fragment 
sizes can be qualitatively reproduced employing a 
constant decay rate (r| = 200 nt), with the further prescrip- 
tion that the shape parameters depend logarithmically on 
the molecule length (Supplementary Figure S4). 

With our parameterised hydrolysis model (r\ = 200 nt, 
6-2.6 for VATG, 8-3.2 for OBF5 and 8-4.1 for 
Lambdaclonel-1) we then investigated the abundance 
distribution of fragments observed along transcript 
bodies. To avoid biases that have been demonstrated 
to impact on the ends of fragments in the considered 
experimental protocol (7,8), we focused during our 
analysis on the distribution of fragment midpoints 
along the RNA molecule they have been derived from. 
The top panels of Figure 3 show the density of such 
fragment centres produced by in silico hydrolysis along 
the three transcript bodies of Lambdaclonel-1, OBF5 
and VATG (primary axis), segregated by the respective 
fragment size (secondary axis). The corresponding 
bottom panels depict the experimental outcome, which 
is sensitive to additional influences from other steps (e.g. 
size selection). 

Albeit there are differences, the positional biases pre- 
dicted by the hydrolysis simulation reproduce qualita- 
tively the patterns of fragment concentrations observed 
in the experiment: the short VATG control exhibits 
three such distinct points (Figure 3, left), whereas the 
mRNA-sized OBF5 control shows seven fragment accu- 
mulations (Figure 3, centre) — and in both cases such 
points are located with remarkable symmetry about the 



centre of the reference molecule. Density fluctuations of 
Lambdaclonel-1 (Figure 3, right), the longest of the 
spike-in sequences considered, fall below the resolution 
limit of the diagram (Supplementary Figure S5). 

Convolution of physical with chemical biases 

After elucidating positional preferences caused by physical 
attributes of RNA molecules, we set off to establish com- 
putational models for capturing biases caused by a mol- 
ecule's sequence composition. Some sensitivity of 
RNA-Seq coverage to the GC content had already been 
noted earlier (6), especially in protocols involving PCR- 
amplification (17). In agreement with these previous 
studies we found that empirical PCR amplification effi- 
ciency can be appropriately modelled by a Gaussian dis- 
tribution centred around a GC content of 50% 
(mean = 0.5, SD 0.1; Supplementary Figure S6). 

In the case studies of spike-in sequences described in the 
previous section, we assessed the correlation between the 
number of fragments covering a certain position and the 
GC content in a window of 192 nt (the mean fragment 
size) centred at that position (Figure 4, top panels): for 
the Lambdaclonel-1 and the OBF5 controls we found a 
high correlation (Pearson coefficient of 0.91 and 0.97, re- 
spectively) between binned GC fraction and the respective 
fragment coverage, whereas in the VATG control both 
attributes strongly anti-correlated (Pearson coefficient 
—0.88). These apparently contradictory observations 
cannot be satisfactorily explained just by a significantly 
larger range of GC content in the former two controls 
(ranging from —30% to >50% GC) as opposed to a 
quite tight spectrum (39^15% GC) in the latter case. 

The reasons behind this seemingly paradoxical depend- 
ence on GC content become clearer when considering 
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Figure 3. RNA hydrolysis produces characteristic patterns of fragment accumulations. Distribution of fragment midpoints along RNA control 
sequences added to an RNA-Seq experiment that involves fragmentation by hydrolysis. The x-axis resolves the positions of fragment centres along 
the RNA control molecules VATG (right, 376 nt), OBF5 (centre, 1429nt) and Lambdaclonel-1 (left, 11 934nt long), whereas the i'-axis segregates the 
obtained fragments according to their lengths. Data are shown as density scatter plots, with observed frequencies increasing from blue to red to 
yellow. The top panels show the distributions predicted by simulation employing the hydrolysis model implemented in the Flux Simulator, whereas 
the corresponding experimental outcome is shown in the bottom panels. Distinct points of fragment accumulations are notable in the short reference 
VATG and in the mRNA-sized control OBF5. 



GC-biases together with positional biases caused by frag- 
mentation (Figure 4, bottom panels): in Lambdaclonel-1 
and OBF5 fragments, coverage (red curve) declines where 
GC content (blue curve) drops; in VATG (Figure 4, right 
bottom panel), on the other hand, GC content shows a 
drop about the centre of the molecule where — consistent 
with our hydrolysis model (Figure 3) — the mutual over- 
lap of fragment accumulations causes a coverage peak 
(Figure 4, left bottom panel). Similar observations hold 
for other control sequences from the same experiment. 
Interestingly, the transcript AGP, which has a length 
similar to that of VATG (325 nt versus 376 nt), exhibits — 
in contrast to VATG — a pronounced dependency of 
fragment coverage on GC: this is due to the fact that in 
the case of AGP, GC-distribution along the molecule and 
positional preferences of hydrolysis mutually amplify about 
the molecule's centre (Supplementary Figure S7). 

Sequence-selectivity at the ends of sequenced fragments 

RNA-Seq is known to introduce biases not only in 
relation to the fraction of G and C nucleotides present 
in a sequence, but also for certain nucleotides towards 
the ends of a sequenced fragment, manifesting in motifs 
of bases preferred at specific positions (7). In agreement 
with earlier reports that predict fragment end positions by 
employing correspondingly observed motifs modelled as 



position weight matrices (PWMs), we found only 
moderate correlations between the observed fluctuations 
and the predictions based on PWMs (8). Supplementary 
Figure S8 depicts the effect of sequence-selectivity — which 
has been attributed to the enzyme-subtrate kinetics of 
randomly-primed reverse transcription process (7) and/ 
or adapter ligation to cDNA molecules (16). 

To alleviate such biases, a modified hydrolysis protocol 
is sometimes performed, where the ligation of adapter se- 
quences to the RNA molecule comes before RT and the 
latter is carried out with primers specifically targeting 
anchor sequences in the adapters. Variants of such 
'RNA-ligation' based methods differ in the way adapter 
sequences are attached to the respective 5'- and 3'-ends 
of RNA fragments, e.g. by the use of standard RNA 
ligase (17) or by poly- A polymerase and special circular 
ligase (18). 

Both methods have been reported to improve the uni- 
formity of read coverage along transcripts. Here we 
evaluate our computational models by analysing the dif- 
ference between simulations with PWMs (derived from 
RNA-Seq data sets produced by the standard hydrolysis 
protocol) and the experimental results of the RNA- 
ligation method called FRT-Seq (as RT is performed on 
an Illumina flowcell), in the case of a human placental 
sample. In addition to the difference in substrate when 
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Figure 4. Convolution of hydrolysis biases with sensitivity to GC content. The GC content of the three spike-in controls has been measured per 
position considering a surrounding window corresponding to the mean observed insert size (192 nt). To estimate coverage, the number of fragments 
that include a respective position is taken into account. Top panels: for each of 20 equally sized GC bins (x-axis), a box plot summarizes the 
distribution of fragment coverage (v-axis). Within their specific spectrum of GC content, Lambdaclonel-1 and OBF5 show a positive correlation of 
coverage with GC content, whereas for the VATG control, GC content and coverage anti-correlate. Bottom panels: the GC content (blue curve, 
normalised to [0;1] for each transcript) along the molecules (x-axis) is shown in comparison with the relative fragment coverage (red curve). 
Lambdaclonel-1 and OBF5 exhibit synchrone variations in GC content and coverage, whereas the relative GC content of VATG scores minimal 
around the molecule's centre, where the mutual overlap of hydrolysis products provokes a peak in coverage (Figure 3). 



ligating adapters, the FRT-protocol is PCR-free and 
employs no specific size selection step (17). 

Figure 5 shows that the PWMs derived from read se- 
quences differ substantially in the two cases. The informa- 
tion content, a logarithmic measure of the deviation from 
uniformly distributed nucleotide frequencies that corres- 
pond in the depicted sequence logos to the height of a 
stack of letters at every position, describes less severe 
biases in the FRT-Seq protocol (Figure 5A) than in the 
standard hydrolysis protocol (Figure 5B). Consequently, 
we observe a higher degree of transcript coverage in the 
FRT experiment (Figure 5A versus B, black bars). The 
trend can be reproduced in silico when providing the cor- 
responding PWM and de-activating simulated PCR and 
size selection (Figure 5A and B, grey bars). Differences 
between the simulated and the experimental data set are 
mainly due to different mapping redundancies: on average 
~ 1 .82 mappings per read are found for the experimental 
data set, whereas in the simulated data to every read 
exactly one mapping can be assigned. 

Simulation of generic RNA-Seq experiments 

We then employed the entire Flux Simulator pipeline to 
assess how well the models described so far — when 
combined — can mimic the overall distribution of reads 
along RNA molecules in populations of cellular tran- 
scripts. To allow the simulation of realistic transcript 



expression levels, we developed a transcriptome simulator: 
it is based on Zipf s Law — which governs gene expres- 
sion — and modified according to further empirical obser- 
vations from RNA-Seq experiments (Supplementary 
Figure S3 and Supplementary Table S2). 

Since sequencing-by-synthesis protocols produce reads 
whose first nucleotide identifies the fragment edge (i.e. the 
breakpoint) and whose mapping directionality further 
reveals the nature of the fragment edge (i.e. whether it 
constitutes a 5'- or 3'-end, Supplementary Figures SI 
and S9), we separately focused on breakpoint distributions 
for reads mapping in sense and in antisense directions, 
thus preventing influences on sequence coverage by differ- 
ent read lengths. In our benchmark, we investigated 
four different experiments (i.e. the last four rows in 
Supplementary Table SI) that differ in species/tissue of 
the sequenced RNA, sample preparation and sequencing 
platform (9,11,28). For each data set, we provide a 
parameterised in silico model (Supplementary Table S3), 
and we compare the experimental observation with the 
simulation. 

The results of our comparisons are summarized 
in Figure 6. As a general trend, they reproduce earlier 
observations (28) that reads from 5'-ends of fragments 
(sense mappings) generally increase close to the 5'-end of 
transcripts and decrease close to the 3'-end, whereas the 
3'-ends of fragments (antisense mappings) exhibit an 
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Figure 5. Correlation of sequence biases with transcript coverage in a 
standard hydrolysis experiment (A) and an RNA-ligation method (B). 
The panels above the bar plots depict sequence logos that capture 
biases as observed in the first six bases of sequenced reads, where the 
height of the letters are scaled according to the information content 
(IC), a logarithmic measurement that captures the divergence from 
equally distributed bases. The bar histogram in (A) and (B) show the 
corresponding transcript coverage by experimentally obtained reads 
(black bars), respectively, by correspondingly simulated reads (grey 
bars). The standard hydrolysis protocol exhibits stronger sequence se- 
lectivity that produces read stacks at specific positions and thereby 
reduces the overall transcript coverage. 



inverse effect (Figure 6). Our simulation reveals that the 
phenomenon is due to the fragmentation step, given that 
5'/3'-ends of transcripts are also naturally included as 5'/ 
3'-ends of some of the fragments produced from them; 
therefore, the fraction of transcription start sites preserved 



in the fragment population is higher for short transcripts 
that exhibit a comparatively lower number of breakpoints 
(Figure 6, left panels). A corresponding increase of anti- 
sense mappings is predicted by our simulations at the 
3'-end of the transcripts, however, corresponding reads 
fall into the poly-A tail not included in Figure 6. 

In Figure 6A, we assessed the distribution of reads for the 
hydrolysis protocol investigated in detail in a previous 
section. In agreement with earlier reports (1), the 
transcript-specific biases we pinpoint are not identifiable 
when sufficiently heterogeneous molecule groups are con- 
sidered together (1 1). Only the small reduction of reads next 
to the 5'-bin in transcripts with <2000 nt reflects a cumula- 
tive effect of fragments that fall along sufficiently similar 
Weibull distribution (left and centre panel of Figure 6A). 

In Figure 6B, we compare these results with a recent 
adaptation of the hydrolysis protocol that has been 
employed to produce the Illumina Body Map 2 (accession 
number ERP000546 in the European Nucleotide Archive). 
The experiment produced reads exclusively from the sense 
strand of RNAs obtained from a mixture of 16 tissues 
(libraries HCT20170 and HCT20173). Therefore, only 
spurious amounts of antisense mappings can be 
observed which, in agreement with previous reports 
about antisense transcription, can be found especially at 
the 5'-/3'-ends of long transcripts (Figure 6B, right panel). 

In this protocol, the longer reads (100 nt) — and there- 
fore also larger fragments — cause a more accentuated 
drop of read density towards the 5'-end. Moreover, the 
use of a so-called 'ribofree' technology allows extracting 
RNA species without relying on the presence of a poly-A 
tail. We therefore expect the downstream ends of 3'-most 
fragments — which would be represented by antisense 
mappings absent from this experiment — are at (or close 
to) the respective cleavage sites. Consequently, we observe 
the frequency of sense mappings to decrease at positions 
closer than the average fragment size to 3'-end of the 
transcribed sequence. The effect is marginally stronger in 
experimental data than when reproduced in silico, 
indicating that additional mechanisms might play a role 
here. However, our simulations are able to qualitatively 
reproduce that 3'-regions which suffer from such read 
under-representation are comparatively larger in short 
and medium-sized transcripts (Figure 6B, left and centre 
panel). 

To simulate the experiment depicted in Figure 6C, we 
replaced the uniformly random fragmentation model by 
enzymatic digestion with DNAsel (9) and moved it after 
RT, which in this protocol has been realised by poly-t/T 
priming on the original transcript templates. Our 
models correctly predict the under-representation of 
5'-end information in poly-c/T primed RT due to the 
simulated template loss of the reverse transcriptase 
during first-strand synthesis (1) — and an increasing 
impact of the bias from short to long transcripts (Figure 
6B, left versus centre versus right panel). 

In Figure 6D, we compared simulation results to with an 
experiment employing cDNA nebulisation in contrast to 
fragmentation by DNAsel (28). Our model of mechanical 
breakage is able to reproduce the known bias of read dis- 
tribution towards the centres of the transcripts (28), 
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Figure 6. Comparison of simulated reads with experimental evidence in different sequencing protocols. For each experiment, transcripts from a 
reference annotation of the corresponding species have been classified into short (<1000nt, left panels), intermediate (1000-2000nt, centre panels), 
and long forms (>2000nt, right panels). Red and orange bars show reads from the experiment that align in sense and antisense, respectively, to the 
directionality of transcription, the corresponding in silico results are shown as dark and light blue bars, respectively. (A) Read tag distributions from 
an RNA hydrolysis protocol in M. musculus sequenced on the Illumina GA2 platform. (B) A different hydrolysis experiment carried out with the 
recent HiSeq2000 technology (Illumina), producing longer reads that exclusively map in sense orientation, so called 'dir RNA-Seq'). (C) A 
complementary Illumina experiment employing poly-ifT primed RT and subsequent DNAse digestion of the (poly-A + ) transcriptome of S. 
cerevisisae. (D) Results from an experiment in A. thaliana where poly-iff primed RT products are fragmented by nebulisation. 



especially in shorter transcripts that break few times 
(Figure 6D, left and centre panels); multiple recursive 
breaks along the body of long transcripts thin out these 
points of sharp breakpoint accumulation (Figure 6D, 
right panel). 



DISCUSSION 

We present the Flux Simulator, a framework for 
simulating RNA-Seq experiments in silico that breaks 
down heterogeneous sample preparation protocols into 
their atomic steps (Figure 1). For each step, we provide 



tunable computational models with a minimal set of free 
parameters, whose values can be estimated by correspond- 
ing quantities observed in real experiments. The Flux 
Simulator pipeline implements these steps as modules 
that can be flexibly joined: this structure allows simulation 
of arbitrary protocols. In the present article we focus on 
several protocols employed for the currently popular 
Illumina and Roche 454 sequencing platforms, but the 
modularity of our simulation platform allows analysis of 
arbitrary sequencing technologies, as those announced for 
the future by the manufacturers Ion Torrent (34) and 
Pacific Biosciences (35). Although our models are largely 
of approximate nature and describe in a simplified way the 
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underlying complex chemistry, we show that our simula- 
tions reproduce fairly well the read distributions observed 
in practice (Figure 6). 

If we accept that our bioinformatics models capture 
the main origins of the experimental biases, our simula- 
tion enables us to investigate intermediate stages of 
sequencing protocols — usually hidden layers of 
RNA-Seq (Figures 2-4). Specifically, we give computa- 
tional and experimental evidence as to why insert size dis- 
tributions obtained by hydrolysis differ substantially 
between transcripts of different lengths (Figure 2). In the 
light of the uniform random fragmentation model we de- 
veloped, the dependence of the RNA molecules' geometry 
on their length can be interpreted as shorter molecules 
being more linearised when hydrolysed, whereas longer 
RNA polymers — in spite of strongly denaturing condi- 
tions — still tend to form higher order structures. 
Therefore, size filtering alters the way transcripts are rep- 
resented in the library as a function of the length of the 
original RNA molecule. 

In addition, our models show why fragments ob- 
tained by sequence-independent fragmentation processes, 
as for instance cDNA nebulisation or RNA hydrolysis, 
are not uniformly distributed along the fragmented 
molecule, but occur more frequently at rather specific 
points: the ends of nebulised fragments accumulate at 
the midpoints of recursively split molecules (Figure 6D), 
whereas fragment density obtained by RNA hydrolysis 
propagates from a transcript's ends towards its centre in 
patterns produced by characteristic Weibull distribution 
of the obtained insert sizes (Figure 3). Onto these 
patterns one has to superimpose sequence-specific biases 
(Figures 4 and 5). If heterogeneous transcripts are con- 
sidered together, however, the recognition of these biases 
on large scale is complicated (Figure 6). 

As for the computational analysis of RNA-Seq experi- 
ments, we consider our simulation-based studies as a 
serious motivation to debunk the widespread belief that 
all biases should affect the interpretation of data nega- 
tively: in fact, well-understood biases of systematic 
nature are valuable as additional sources of information. 
Therefore, we are convinced that the critical evaluation of 
experiments mimicked in silico will have an increasing 
impact on design and evaluation of bioinformatics 
approaches to RNA-Seq. 



SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Tables 1-3, Supplementary Figures 1-9, 
Supplementary Methods and Supplementary References 
[36-42]. 



AVAILABILITY 

The Flux Simulator is implemented in platform-portable 
Java code (JDK compliance 1.6), source code and 
binaries are freely available via the webpage http://flux 
.sammeth.net. 
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